Session #1: A Conceptual Overview to Data Science

September 11, 2018

Simple Goal

Give food for thought.

Roadmap

Part 1: A Conceptual Overview

Part 2: Programming Refresher

Applications: Rec Engines

Amazon’s recommendation engines examine past purchase behavior to personalize and surface products to relevant consumers.

Applications: Satellite imagery

Orbital Insights deploys computer vision algorithms to count cars at retail store parking lots in order for investors to better approximate quarterly earnings.

Applications: Email receipts

Quandl used a panel of email receipts to predict the effects of an Uber platform charge policy.

Applications: Pricing

Zillow collected most of the housing sales data in the US and constructed a price prediction model to help sellers price their homes more competitively when they choose to put the property on the market.

Applications: Fire Prediction

FDNY among other city fire departments around train algorithms to predict where fires will in order to target fire safety inspections.

Applications: Anomaly Detection

Advanced time series modeling can be used to detect anomalous activity so that online platforms can safeguard their assets.

What’s the pattern!

All of these examples have a common structure.

What is the underlying structure?

Detect the pattern!

Applications: BEA

What are some good uses of this paradigm at BEA?

[Your Ideas Go Here]

Applications: BEA

What are some good uses of this paradigm at BEA?

What is it we’re talking about?

Data science has a rather fluffy definition as it is interdisciplinary field. But generally, there is agreement that data science sits at the intersection of mathematical inference, computer science and subject matter expertise.

Venn diagram by Drew Conway

Venn diagram by Drew Conway

How is data science different from other fields?

Data science uses statistical inference and computational algorithms to develop applications that communicate an actionable insight.

\[\text{Data Science} = f(\text{Statistics, Computer Science})\]

Modern statisticians will claim that data science is no different, but they tend not to operationalize insight. Computer scientists have long developed applications, but are not interested in inference.

\[\text{Statistics} \neq \text{Computer Science} \]

Data science is not social science as it does not rely on formal social theories, but rather starts from the first principles of inference from data.

\[\text{Data Science} \neq \text{Social Science} \]

So so so many buzzwords

Data science is also a marketing term and part of a larger universe of buzzwords:

So so so many buzzwords

Data science is also a marketing term and part of a larger universe of buzzwords:

What is data science good for? Where does it fall short?

Yea Nay
Adoption Embraced mostly in fields where there is a rapid expansion of data and theory has not yet formed Data science tends to clash with well-established fields.
Interpretation Underlying skills can be flexibly applied There isn’t a gold standard for how it should be applied.
Staffing need Very few people required to do the job well Fluffy definition means little quality control of who is a data scientist
Long term outlook Skills will likely persist into the future Whereas early days of data science focused on generalist practitioners, future practitioners will be field specific – re-adsorbed into host field.

Why economists make for good data scientists?

Data science has been hyped. It is really easy to get our hands on more data and tech. Many data practitioners will use the tools like tossing spaghetti on walls and hope that something sticks.

Why economists make for good data scientists?

Who’s left? Natural and social scientists like economists who tend to be skeptical. They are good at asking a lot of pointed quesitons.

What makes for a good application?

Applications tend take on one of three forms:

What we do with the data needs to have a point.

The Dichotomy of Explanations and Predictions

Models 1 and 2 have been estimated for time series \(y\).

The Dichotomy of Explanations and Predictions

Models 1 and 2 have been estimated for time series \(y\).

Explanations vs. Predictions

The trade-off will more often than not be interpretability versus accuracy.

Story-Driven Model-Driven
Story derived from regression coefficients, but model that has low empirical accuracy. The number may be very close to reality, but the model does not lend itself to tell a story.
Tend to place greater weight on a variables. Tend to focus on minimizing error.

What makes for a good application?

Remainder of today will focus on prediction.

How are prediction projects typically structured?

Note: Different companies will have a different ways of describing this process – it’s largely for marketing purposes, but is effectively the same.

How are prediction projects typically structured?

Recommendation engines

Association score can rank relevant products to facilitate sales. User purchases are cleaned and aggregated into a user-product matrix, then cosine similarity is calculated to find which products are correlated with other products.

How are prediction projects typically structured?

Housing sales price prediction

Predicted price to help set the asking price. Housing sales records are structured into a cross-sectional data set of housing attributes, then a regression is used to correlated house attributes to the price in order to predict prices of unsold houses.

How are prediction projects typically structured?

Job board salaries

Set expectations of users and facilitates more targeted search. Job listings with salary data are structured into a cross-sectional data set of job attributes, then a regression is used to correlated job attributes to the salary in order to predict salaries for postings that are missing that information.

How are prediction projects typically structured?

Ride share targeting

Past Uber rides are turned into a time series for every half mile grid cell for a given city. A time series model like ARIMA, Neural Net, Theta Algorithm, Holt-Winters or other method is applied to predict the level of expected ridership over the next few hours. This is then used to send alerts to drivers to direct them to hotbeds.

How are prediction projects typically structured?

All of these cases use terabytes to petabytes of information for daily production. Steps 2 through 5 are engineered as a program that can run on its own.

Where do we start?

(1) Use Case Statement

Set the initial starting parameters by answering these questions

(1) Use Case Statement: Example

Example using recommendation engines.

\[ cos(\theta) = \frac{\sum^n_{i=1}A_iB_i}{\sqrt{\sum^n_{i=1}A_i^2}\sqrt{\sum^n_{i=1}B_i^2}} \] - What are the success criteria? Whether recommended products were clicked on, whether recommended products were purchased more than products with lower probability.

(1) Use Case Statement: Example

Example using NEA QSS.

(1) Contemporaneous Model (2) Traditional Lag (Momentum)
\[y_t = f(x_t)\] \[y_{t} = f(x_{t-1})\]

(2) Data Acquisition

Very few rules except that the data reflect the state of the world at the time of each estimate cycle so that we can simulate forecasting.

Rec Engine

QSS Prediction

(3) Data Engineering

(4) Predictive Modeling

Modeling is typically split between model training and model validation.

(4) Predictive Modeling: Training

Given a target \(y\) and inputs \(X\), training means calibrating a machine learning algorithm to mimick Y subject to a loss function.

Goal is to find the model that produces the lowest loss function = highest accuracy.

(4) Predictive Modeling: OLS Example

Even regression (\(y = \beta_0 + \beta_1 x_i + \epsilon\)) is an iterative process minimizing the mean squared error.

https://towardsdatascience.com/linear-regression-the-easier-way-6f941aa471ea

https://towardsdatascience.com/linear-regression-the-easier-way-6f941aa471ea

(4) Predictive Modeling: Methods

Modeling in the social sciences tend to focus on parametric methods (e.g. linear regression) as they lend themselves to a story. In data science, the goal is predicting the number, thus methods are far more diverse and more complex.

Common Econ Methods Data Science Methods
- Linear regression - Linear regression
- Stepwise regression - Stepwise regression
- ARIMA - ARIMA
- Bayesian Vector Autoregression - Bayesian Vector Autoregression
- Rolling Average - Regularized regression (LASSO, Ridge, Elastic Net)
- Neural Networks, LSTM
- Support Vector Machines (SVM, SVR)
- Random Forests
- Boosting (Adaptive, Extreme)
- Ensemble averaging and stacking
- +1000’s more

(4) Predictive Modeling - Considering Model Validation Early On

When producing predictions, the data are usually split into a training and testing set.

Why?

https://am207.github.io/2017/wiki/validation.html

https://am207.github.io/2017/wiki/validation.html

(4) Predictive Modeling - Considering Model Validation Early On

Remember the goal is to find the model that has the lowest loss function. As the model iteratively learns from data, it will have seen the data multiple times. It’s like allowing a student to cheat off of all other students in the class.

(4) Predictive Modeling - Considering Model Validation Early On

When students learn test answers but not the concept, it is similar to when models learn data values but not the patterns. This is called overfitting.

(4) Predictive Modeling - Considering Model Validation Early On

Partitioning the data involves splitting the data into a training set for the student to learn. Then giving it a previously unseen scenario to test it on.

Would you limbo twice?

(4) Predictive Modeling: Example

QSS Prediction: Goal is to predict revenue for each service sector industry

(5) Deployment: Not Rube Goldberg

This sounds too complicated for everyday use.

The beauty of modern technology and techniques is that it can turn tedium into something that is surefire. The goal is to engineer a machine that produces artisan-quality results at the click of a button.

(5) Deployment: Not Rube Goldberg

We’re really trying to write one program to rule them all.

(5) Deployment: Example

QSS Prediction

rubeGoldberg()

Most data production companies in private sector like Bloomberg tend to be vertically integrated so that economic research products are

Everyone tends to know the same code base (e.g. “We are a Python shop”)

Recap: Can you find a problem you face that fits this framework?

(5 mins)

Part II: Code Refresher

Data science generally relies on one of two coding languages: R and Python.

Today, we will have a quick refresh of R and walk through two basic scripts that represent how a basic data science project could work.